Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Robust Semantic Segmentation against Patch-based Attack via Attention Refinement (2401.01750v2)

Published 3 Jan 2024 in cs.CV

Abstract: The attention mechanism has been proven effective on various visual tasks in recent years. In the semantic segmentation task, the attention mechanism is applied in various methods, including the case of both Convolution Neural Networks (CNN) and Vision Transformer (ViT) as backbones. However, we observe that the attention mechanism is vulnerable to patch-based adversarial attacks. Through the analysis of the effective receptive field, we attribute it to the fact that the wide receptive field brought by global attention may lead to the spread of the adversarial patch. To address this issue, in this paper, we propose a Robust Attention Mechanism (RAM) to improve the robustness of the semantic segmentation model, which can notably relieve the vulnerability against patch-based attacks. Compared to the vallina attention mechanism, RAM introduces two novel modules called Max Attention Suppression and Random Attention Dropout, both of which aim to refine the attention matrix and limit the influence of a single adversarial patch on the semantic segmentation results of other positions. Extensive experiments demonstrate the effectiveness of our RAM to improve the robustness of semantic segmentation models against various patch-based attack methods under different attack settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR, pp. 2818–2826 (2016) Zoph et al. [2018] Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR, pp. 8697–8710 (2018) Tan and Le [2021] Tan, M., Le, Q.V.: Efficientnetv2: Smaller models and faster training. In: ICML, pp. 10096–10106 (2021) Chen et al. [2015] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR (2015) Chen et al. [2018] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR, pp. 8697–8710 (2018) Tan and Le [2021] Tan, M., Le, Q.V.: Efficientnetv2: Smaller models and faster training. In: ICML, pp. 10096–10106 (2021) Chen et al. [2015] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR (2015) Chen et al. [2018] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Tan, M., Le, Q.V.: Efficientnetv2: Smaller models and faster training. In: ICML, pp. 10096–10106 (2021) Chen et al. [2015] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR (2015) Chen et al. [2018] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR (2015) Chen et al. [2018] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  2. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR, pp. 8697–8710 (2018) Tan and Le [2021] Tan, M., Le, Q.V.: Efficientnetv2: Smaller models and faster training. In: ICML, pp. 10096–10106 (2021) Chen et al. [2015] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR (2015) Chen et al. [2018] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Tan, M., Le, Q.V.: Efficientnetv2: Smaller models and faster training. In: ICML, pp. 10096–10106 (2021) Chen et al. [2015] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR (2015) Chen et al. [2018] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR (2015) Chen et al. [2018] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  3. Tan, M., Le, Q.V.: Efficientnetv2: Smaller models and faster training. In: ICML, pp. 10096–10106 (2021) Chen et al. [2015] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR (2015) Chen et al. [2018] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR (2015) Chen et al. [2018] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  4. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR (2015) Chen et al. [2018] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  5. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40(4), 834–848 (2018) Chen et al. [2017] Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  6. Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) Liu et al. [2017] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  7. Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp. 6738–6746 (2017) Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  8. Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: Cosface: Large margin cosine loss for deep face recognition. In: CVPR, pp. 5265–5274 (2018) Deng et al. [2019] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  9. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019) Xiao et al. [2021] Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  10. Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., Zhu, J.: Improving transferability of adversarial patches on face recognition with generative models. In: CVPR, pp. 11845–11854 (2021) Wei et al. [2022] Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  11. Wei, X., Guo, Y., Yu, J.: Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI (2022) Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  12. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) Serrurier et al. [2021] Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  13. Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., Barrio, E.: Achieving robustness in classification using optimal transport with hinge regularization. In: CVPR, pp. 505–514 (2021) Wang et al. [2019] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  14. Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML, vol. 97, pp. 6586–6595 (2019) Kamann and Rother [2020a] Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  15. Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: CVPR, pp. 8825–8835 (2020) Kamann and Rother [2020b] Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  16. Kamann, C., Rother, C.: Increasing the robustness of semantic segmentation models with painting-by-numbers. In: ECCV, vol. 12355, pp. 369–387 (2020) Xie et al. [2017] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  17. Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: ICCV, pp. 1378–1387 (2017) Michaelis et al. [2019] Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  18. Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., Brendel, W.: Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019) Zhou et al. [2017] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  19. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 5122–5130 (2017) Xiao et al. [2018] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  20. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, vol. 11209, pp. 432–448 (2018) Liu et al. [2022] Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  21. Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR, pp. 11966–11976 (2022) Jain et al. [2021] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  22. Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782 (2021) Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  23. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021) Strudel et al. [2021] Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  24. Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV, pp. 7242–7252 (2021) Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  25. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  26. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) Zhao et al. [2017] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  27. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017) Wang et al. [2018] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  28. Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018) Yuan and Wang [2018] Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  29. Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018) Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  30. Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019) Li et al. [2019] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  31. Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV, pp. 9166–9175 (2019) Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  32. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021) Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  33. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS, pp. 12077–12090 (2021) Cheng et al. [2021] Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  34. Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021) Cheng et al. [2022] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  35. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1280–1289 (2022) Brown et al. [2017] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  36. Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017) Karmon et al. [2018] Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  37. Karmon, D., Zoran, D., Goldberg, Y.: Lavan: Localized and visible adversarial noise. In: ICML, vol. 80, pp. 2512–2520 (2018) Yang et al. [2020] Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  38. Yang, C., Kortylewski, A., Xie, C., Cao, Y., Yuille, A.L.: Patchattack: A black-box texture-based attack with reinforcement learning. In: ECCV, vol. 12371, pp. 681–698 (2020) Liu et al. [2019] Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  39. Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., Li, H.: DPATCH: an adversarial patch attack on object detectors. In: Workshop on Artificial Intelligence Safety 2019 Co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, vol. 2301 (2019) Lee and Kolter [2019] Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  40. Lee, M., Kolter, J.Z.: On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897 (2019) Hu et al. [2021] Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  41. Hu, Y., Chen, J., Kung, B., Hua, K., Tan, D.S.: Naturalistic physical adversarial patch for object detectors. In: ICCV, pp. 7828–7837 (2021) Sato et al. [2021] Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  42. Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., Chen, Q.A.: Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In: 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 3309–3326 (2021) Nakka and Salzmann [2020] Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  43. Nakka, K.K., Salzmann, M.: Indirect local attacks for context-aware semantic segmentation networks. In: ECCV, vol. 12350, pp. 611–628 (2020) Mirsky [2021] Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  44. Mirsky, Y.: Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113 (2021) Nesti et al. [2022] Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  45. Nesti, F., Rossolini, G., Nair, S., Biondi, A., Buttazzo, G.C.: Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In: WACV, pp. 2826–2835 (2022) Athalye et al. [2018] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  46. Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, vol. 80, pp. 284–293 (2018) Fu et al. [2022] Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  47. Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: Are vision transformers always robust against adversarial perturbations? In: ICLR (2022) Lovisotto et al. [2022] Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  48. Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C.K., Metzen, J.H.: Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In: CVPR, pp. 15213–15222 (2022) Kirillov et al. [2019] Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  49. Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR, pp. 6399–6408 (2019) Yu et al. [2022] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  50. Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR, pp. 10809–10819 (2022) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  51. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021) Benz et al. [2021] Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  52. Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In: BMVC, p. 25 (2021) Wang et al. [2022] Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  53. Wang, Z., Bai, Y., Zhou, Y., Xie, C.: Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452 (2022) Bai et al. [2021] Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  54. Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are transformers more robust than cnns? In: NeurIPS, pp. 26831–26843 (2021) Mahmood et al. [2021] Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  55. Mahmood, K., Mahmood, R., Dijk, M.: On the robustness of vision transformers to adversarial examples. In: ICCV, pp. 7818–7827 (2021) Bhojanapalli et al. [2021] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  56. Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV, pp. 10211–10221 (2021) Bai et al. [2022] Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  57. Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., Liu, W.: Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993 (2022) Shao et al. [2022] Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  58. Shao, R., Shi, Z., Yi, J., Chen, P.-Y., Hsieh, C.-J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2022) Debenedetti et al. [2022] Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  59. Debenedetti, E., Sehwag, V., Mittal, P.: A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399 (2022) Wu et al. [2022] Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  60. Wu, B., Gu, J., Li, Z., Cai, D., He, X., Liu, W.: Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498 (2022) Rando et al. [2022] Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  61. Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761 (2022) Herrmann et al. [2022] Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  62. Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., Sun, D.: Pyramid adversarial training improves vit performance. In: CVPR, pp. 13409–13419 (2022) Salman et al. [2022] Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  63. Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. In: CVPR, pp. 15116–15126 (2022) Chen et al. [2022] Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  64. Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., Zhang, W.: Towards practical certifiable patch defense with vision transformer. In: CVPR, pp. 15127–15137 (2022) Huang and Li [2021] Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  65. Huang, Y., Li, Y.: Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481 (2021) Mao et al. [2022] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  66. Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards robust vision transformer. In: CVPR, pp. 12032–12041 (2022) Gu et al. [2022] Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  67. Gu, J., Tresp, V., Qin, Y.: Are vision transformers robust to patch perturbations? In: ECCV, vol. 13672, pp. 404–421 (2022) Luo et al. [2016] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  68. Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp. 4898–4906 (2016) Srivastava et al. [2014] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  69. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958 (2014) Everingham et al. [2015] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  70. Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV 111(1), 98–136 (2015) Cordts et al. [2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  71. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016) Croce and Hein [2020a] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  72. Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, vol. 119, pp. 2206–2216 (2020) Croce and Hein [2020b] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  73. Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: ICML, vol. 119, pp. 2196–2205 (2020) Andriushchenko et al. [2020] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020) Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)
  74. Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: A query-efficient black-box adversarial attack via random search. In: ECCV, vol. 12368, pp. 484–501 (2020)

Summary

We haven't generated a summary for this paper yet.